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Abstract 



We propose a new approach to value function approximation which combines lin- 
ear temporal difference reinforcement learning with subspace identification. In 
practical applications, reinforcement learning (RL) is complicated by the fact that 
state is either high-dimensional or partially observable. Therefore, RL methods 
are designed to work with features of state rather than state itself, and the suc- 
cess or failure of learning is often determined by the suitability of the selected 
features. By comparison, subspace identification (SSID) methods are designed to 
select a feature set which preserves as much information as possible about state. 
In this paper we connect the two approaches, looking at the problem of reinforce- 
ment learning with a large set of features, each of which may only be marginally 
useful for value function approximation. We introduce a new algorithm for this 
situation, called Predictive State Temporal Difference (PSTD) learning. As in 
SSID for predictive state representations, PSTD finds a linear compression op- 
erator that projects a large set of features down to a small set that preserves the 
maximum amount of predictive information. As in RL, PSTD then uses a Bellman 
recursion to estimate a value function. We discuss the connection between PSTD 
and prior approaches in RL and SSID. We prove that PSTD is statistically consis- 
tent, perform several experiments that illustrate its properties, and demonstrate its 
potential on a difficult optimal stopping problem. 

1 Introduction and Related Work 

We examine the problem of estimating a policy's value function within a decision process in a 
high dimensional and partially-observable environment, when the parameters of the process are 
unknown. In this situation, a common strategy is to employ a linear architecture and represent 
the value function as a linear combination of features of (sequences of) observations. A popular 
family of model-free algorithms called temporal difference (TD) algorithms [ 1 ] can then be used 
to estimate the parameters of the value function. Least-squares TD (LSTD) algorithms |2j [3j 
exploit the linearity of the value function to find the optimal parameters in a least-squares sense 
from time-adjacent samples of features. 

Unfortunately, choosing a good set of features is hard. The features must be predictive of future 
reward, and the set of features must be small relative to the amount of training data, or TD learning 
will be prone to overfitting. The problem of selecting a small set of reasonable features has been 
approached from a number of different perspectives. In many domains, the features are selected by 
hand according to expert knowledge; however, this task can be difficult and time consuming in prac- 
tice. Therefore, a considerable amount of research has been devoted to the problem of automatically 
identifying features that support value function approximation. 

Much of this research is devoted to finding sets of features when the dynamical system is known, but 
the state space is large and difficult to work with. For example, in a large fully observable Markov 
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decision process (MDP), it is often easier to estimate the value function from a low dimensional set 
of features than by using state directly. So, several approaches attempt to automatically discover a 
small set of features from a given larger description of an MDP, e.g., by using a spectral analysis 
of the state-space transition graph to discover a low-dimensional feature set that preserves the graph 
structure 000. 

Partially observable Markov decision processes (POMDPs) extend MDPs to situations where the 
state is not directly observable [8, 9l H0l . In this circumstance, an agent can plan using a continuous 
belief state with dimensionality equal to the number of hidden states in the POMDP. When the num- 
ber of hidden states is large, dimensionality reduction in POMDPs can be achieved by projecting a 
high dimensional belief space to a lower dimensional one; of course, the difficulty is to find a projec- 
tion which preserves decision quality. Strategies for finding good projections include value-directed 
compression [ 1 1 1 and non-negative matrix factorization lfT2l[T3l . The resulting model after compres- 
sion is a Predictive State Representation (PSR) |[T4l[T5ll . an Observable Operator Model |[T6l . or a 
multiplicity automaton ifTTl . Moving to one of these representations can often compress a POMDP 
by a large factor with little or no loss in accuracy: examples exist with arbitrarily large lossless 
compression factors, and in practice, we can often achieve large compression ratios with little loss. 

The drawback of all of the approaches enumerated above is that they first assume that the dynamical 
system model is known, and only then give us a way of finding a compact representation and a 
value function. In practice, we would like to be able to find a good set of features, without prior 
knowledge of the system model. Kolter and Ng [ 18 1 contend with this problem from a sparse feature 
selection standpoint. Given a large set of possibly-relevant features of observations, they proposed 
augmenting LSTD by applying an L\ penalty to the coefficients, forcing LSTD to select a sparse set 
of features for value function estimation. The resulting algorithm, LARS-TD, works well in certain 



situations (for example, see Section 5.1 1, but only if our original large set of features contains a small 
subset of highly-relevant features. 

Recently, Parr et al. looked at the problem of value function estimation from the perspective of 
both model -free and model -based reinforcement learning |[P9l . The model-free approach estimates 
a value function directly from sample trajectories, i.e., from sequences of feature vectors of visited 
states. The model-based approach, by contrast, first learns a model and then computes the value 
function from the learned model. Parr et al. compared LSTD (a model-free method) to a model- 
based method in which we first learn a linear model by viewing features as a proxy for state (leading 
to a linear transition matrix that predicts future features from past features), and then compute a 
value function from this approximate model. Parr et al. demonstrated that these two approaches 
compute exactly the same value function [19|, formalizing a fact that has been recognized to some 
degree before J5) . 

In the current paper, we build on this insight, while simultaneously finding a compact set of features 
using powerful methods from system identification. First, we look at the problem of improving 
LSTD from a model-free predictive-bottleneck perspective: given a large set of features of history, 
we devise a new TD method called Predictive State Temporal Difference (PSTD) learning that esti- 
mates the value function through a bottleneck that preserves only predictive information (Section^. 
Intuitively, this approach has some of the same benefits as LARS-TD: by finding a small set of pre- 
dictive features, we avoid overfitting and make learning more data-efficient. However, our method 
differs in that we identify a small subspace of features instead of a sparse subset of features. Hence, 
PSTD and LARS-TD are applicable in different situations: as we show in our experiments below, 
PSTD is better when we have many marginally-relevant features, while LARS-TD is better when 
we have a few highly-relevant features hidden among many irrelevant ones. 

Second, we look at the problem of value function estimation from a model-based perspective (Sec- 
tion |4). Instead of learning a linear transition model from features, as in fl9l . we use subspace 
identification ||20l |2TI to learn a PSR from our samples. Then we compute a value function via 
the Bellman equations for our learned PSR. This new approach has a substantial benefit: while the 
linear feature-to-feature transition model of [19] does not seem to have any common uses outside 
that paper, PSRs have been proposed numerous times on their own merits (including being invented 
independently at least three times), and are a strict generalization of POMDPs. 

Just as Parr et al. showed for the two simpler methods, we show that our two improved methods 
(model-free and model-based) are equivalent. This result yields some appealing theoretical benefits: 
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for example, PSTD features can be explicitly interpreted as a statistically consistent estimate of the 
true underlying system state. And, the feasibility of finding the true value function can be shown 
to depend on the linear dimension of the dynamical system, or equivalently, the dimensionality of 
the predictive state representation — not on the cardinality of the POMDP state space. Therefore our 
representation is naturally "compressed" in the sense of 11 1 1 II - speeding up convergence. 

The improved methods also yield practical benefits; we demonstrate these benefits with several ex- 
periments. First, we compare PSTD to LSTD and LARS-TD on a synthetic example using different 
sets of features to illustrate the strengths and weaknesses of each algorithm. Next, we apply PSTD 
to a difficult optimal stopping problem for pricing high-dimensional financial derivatives. A signif- 
icant amount of work has gone into hand tuning features for this problem. We show that, if we add 
a large number of weakly relevant features to these hand-tuned features, PSTD can find a predictive 
subspace which performs much better than competing approaches, improving on the best previously 
reported result for this particular problem by a substantial margin. 

The theoretical and empirical results reported here suggest that, for many applications where LSTD 
is used to compute a value function, PSTD can be simply substituted to produce better results. 

2 Value Function Approximation 

We start from a discrete time dynamical system with a set of states S, a set of actions A, a distribution 
over initial states 7r , a state transition function T, a reward function 1Z, and a discount factor 7 G 
[0, 1]. We seek a policy it, a mapping from states to actions. The notion of a value function is of 
central importance in reinforcement learning: for a given policy 7r, the value of state s is defined 
as the expected discounted sum of rewards obtained when starting in state s and following policy 
7T, J 7r (s) = E Et^o 7*^-( s t) I s = s > n ]- ^ i s we U known that the value function must obey the 
Bellman equation 

r{ s ) = n{s) + 1 Y, -rWPtW I tt) 

s' 

If we know the transition function T, and if the set of states S is sufficiently small, we can use ([T| 
directly to solve for the value function J*\ We can then execute the greedy policy for J w , setting the 
action at each state to maximize the right-hand side of Q, 

However, we consider instead the harder problem of estimating the value function when s is a par- 
tially observable latent variable, and when the transition function T is unknown. In this situation, 
we receive information about s through observations from a finite set O. Our state (i.e., the informa- 
tion which we can use to make decisions) is not an element of S but a history (an ordered sequence 
of action-observation pairs h = a^oj . . . a^o^ that have been executed and observed prior to time 
t). If we knew the transition model T, we could use h to infer a belief distribution over S, and 
use that belief (or a compression of that belief) as a state instead; below, we will discuss how to 
learn a compressed belief state. Because of partial observability, we can only hope to predict reward 
conditioned on history, 7Z(h) = E[7£(s) | h], and we must choose actions as a function of history, 
n(h) instead of tt(s). 

Let % be the set of all possible histories. % is often very large or infinite, so instead of finding a 
value separately for each history, we focus on value functions that are linear in features of histories 

J*(s) =w T c/>' H (h) (2) 

Here w G W is a parameter vector and G MP is a feature vector for a history h. So, we can 

rewrite the Bellman equation as 

w T <j> n (h) = 11(h) + 7 ^ ( /l7ro ) Pr t' l7ro I h ^ (3) 

where hiro is history h extended by taking action n(h) and observing o. 
2.1 Least Squares Temporal Difference Learning 

In general we don't know the transition probabilities Pr[fnro \ h], but we do have samples of state 
features <ffi = (t> u (ht), next-state features <^i|j_ 1 = (^(ht+i), and immediate rewards 1Z t = 1Z(h t ). 
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We can thus estimate the Bellman equation 

w T ^ k « n 1:k + W^ k+1 (4) 

(Here we have used the notation <ft^. k to mean the matrix whose columns are 0^ for t = 1 . . . k.) 
We can can immediately attempt to estimate the parameter w by solving the linear system in the 

least squares sense: w T = 1Zx± (</>^ fc — l4>2-k+x) > where ^ indicates the Moore-Penrose pseudo- 
inverse. However, this solution is biased J3J, since the independent variables (ffi — 70?+i are noisy 
samples of the expected difference E[<^(/i) — 7 J2oeo W (^ 7ro ) Pr[/i7ro | /i]]. In other words, 
estimating the value function parameters w is an error-in-variables problem. 

The least squares temporal difference (LSTD) algorithm provides a consistent estimate of the in- 
dependent variables by right multiplying the approximate Bellman equation (Equation Q by (fif^ . 

The quantity </)^ T can be viewed as an instrumental variable O, i.e., a measurement that is corre- 
lated with the true independent variables, but uncorrelated with the noise in our estimates of these 
variables. 1 The value function parameter w may then be estimated as follows: 

- T = I E (l E « T - 1 E ^f) © 

i=l \ t=l t=l / 

As the amount of data k increases, the empirical covariance matrices <t>\ L -k4>i-k A anc l 

$kk+\ < fttk 1^ converge with probability 1 to their population values, and so our estimate of the ma- 
trix to be inverted in <|3j is consistent. Therefore, as long as this matrix is nonsingular, our estimate 
of the inverse is also consistent, and our estimate of w therefore converges to the true parameters 
with probability 1 . 

3 Predictive Features 

Although LSTD provides a consistent estimate of the value function parameters w, in practice, the 
potential size of the feature vectors can be a problem. If the number of features is large relative to 
the number of training samples, then the estimation of w is prone to overfitting. This problem can 
be alleviated by choosing some small set of features that only contain information that is relevant 
for value function approximation. However, with the exception of LARS-TD |18|, there has been 
little work on the problem of how to select features automatically for value function approximation 
when the system model is unknown; and of course, manual feature selection depends on not-always- 
available expert guidance. 

We approach the problem of finding a good set of features from a bottleneck perspective. That 
is, given some signal from history, in this case a large set of features, we would like to find a 
compression that preserves only relevant information for predicting the value function J 77 . As we 
will see in Section|4] this improvement is directly related to spectral identification of PSRs. 

3.1 Tests and Features of the Future 

We first need to define precisely the task of predicting the future. Just as a history is an ordered 
sequence of action-observation pairs executed prior to time t, we define a test of length i to be an 
ordered sequence of action-observation pairs t = a\0\ . . . a,iOi that can be executed and observed 
after time t [14]. The prediction for a test r after a history h, written r(h), is the probability that we 
will see the test observations t = 0\. . . 0$, given that we intervene [22] to execute the test actions 
t a = ai . . . a,: 

r(h) = Pr[r° I h,do(T A )} 

If Q = {ti, . . . , r„} is a set of tests, we write Q(h) — (ri(/i), . . . , T n (h)) T for the corresponding 
vector of test predictions. 

We can generalize the notion of a test to a feature of the future, a linear combination of several tests 
sharing a common action sequence. For example, if t\ and t-i are two tests with t a — t a = t a , 

'The LSTD algorithm can also be theoretically justified as the result of an application of the Bellman 
operator followed by an orthogonal projection back onto the row space of 0^ (4). 
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then we can make a feature <f) — 3ti + t%. This feature is executed if we intervene to do(r A ), and if 
it is executed its value is 31(rf ) + I(t.P), where I(oi . . . Oi) stands for an indicator random variable, 
taking the value or 1 depending on whether we observe the sequence of observations oi . . .Oi. The 
prediction of <p given h is (f>(h) = E((f> \ h, do(r A )) = 3ti(1i) + T2(h). 

While linear combinations of tests may seem restrictive, our definition is actually very expressive: 
we can represent an arbitrary function of a finite sequence of future observations. To do so, we take a 
collection of tests, each of which picks out one possible realization of the sequence, and weight each 
test by the value of the function conditioned on that realization. For example, if our observations are 
integers 1, 2, . . . , 10, we can write the square of the next observation as 5Z =i ° 2 ^( )' anc ^ tne mean 
of the next two observations as X) =i ElLi \ {° + °')- 

The restriction to a common action sequence is necessary: without this restriction, all the tests 
making up a feature could never be executed at once. Once we move to feature predictions, however, 
it makes sense to lift this restriction: we will say that any linear combination of feature predictions 
is also a feature prediction, even if the features involved have different action sequences. 

Action sequences raise some problems with obtaining empirical estimates of means and covariances 
of features of the future: e.g., it is not always possible to get a sample of a particular feature's value 
on every time step, and the feature we choose to sample at one step can restrict which features we 
can sample at subsequent steps. In order to carry out our derivations without running into these 
problems repeatedly, we will assume for the rest of the paper that we can reset our system after 
every sample, and get a new history independently distributed as h t ~ uj for some distribution lu. 
(With some additional bookkeeping we could remove this assumption l23l . but this bookkeeping 
would unnecessarily complicate our derivations.) 

Furthermore, we will introduce some new language, again to keep derivations simple: if we have a 
vector of features of the future (fp~, we will pretend that we can get a sample cj)J in which we evaluate 
all of our features starting from a single history h t , even if the different elements of <fr T require us 
to execute different action sequences. When our algorithms call for such a sample, we will instead 
use the following trick to get a random vector with the correct expectation (and somewhat higher 
variance, which doesn't matter for any of our arguments): write r-p, t a , . . . for the different action 
sequences, and let d , (2 , • • • > be a probability distribution over these sequences. We pick a single 
action sequence t a according to £, and execute t a to get a sample (f r of the features which depend 
on t„ . We then enter <fi T /( a into the corresponding coordinates of <f>f, and fill in zeros everywhere 
else. It is easy to see that the expected value of our sample vector is then correct: the probability of 
selection ( a and the weighting factor l/£ a cancel out. We will write E((fP~ | h t ,do(()) to stand for 
this expectation. 

None of the above tricks are actually necessary in our experiments with stopping problems: we 
simply execute the "continue" action on every step, and use only sequences of "continue" actions in 
every test and feature. 



3.2 Finding Predictive Features Through a Bottleneck 

In order to find a predictive feature compression, we first need to determine what we would like 
to predict. Since we are interested in value function approximation, the most relevant prediction is 
the value function itself; so, we could simply try to predict total future discounted reward given a 
history. Unfortunately, total discounted reward has high variance, so unless we have a lot of data, 
learning will be difficult. 

We can reduce variance by including other prediction tasks as well. For example, predicting indi- 
vidual rewards at future time steps, while not strictly necessary to predict total discounted reward, 
seems highly relevant, and gives us much more immediate feedback. Similarly, future observations 
hopefully contain information about future reward, so trying to predict observations can help us pre- 
dict reward better. Finally, in any specific RL application, we may be able to add problem-specific 
prediction tasks that will help focus our attention on relevant information: for example, in a path- 
planning problem, we might try to predict which of several goal states we will reach (in addition to 
how much it will cost to get there). 
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We can represent all of these prediction tasks as features of the future: e.g., to predict which goal we 
will reach, we add a distinct observation at each goal state, or to predict individual rewards, we add 
individual rewards as observations. 2 We will write <jx[ for the vector of all features of the "future at 
time i," i.e., events starting at time t + 1 and continuing forward. 

So, instead of remembering a large arbitrary set of features of history, we want to find a small 
subspace of features of history that is relevant for predicting features of the future. We will call this 
subspace a predictive compression, and we will write the value function as a linear function of only 
the predictive compression of features. 



To find our predictive compression, we will use reduced-rank regression 1241 . We define the follow- 
ing empirical covariance matrices between features of the future and features of histories: 

=r.* = I E 4tf T X H<n = I £ ^ (6) 
t=i t=i 

Let L-u be the lower triangular Cholesky factor of Then we can find a predictive compression 

of histories by a singular value decomposition (S VD) of the weighted covariance: write 

UVV T « t r . n L n T (7) 

for a truncated SVD [25 1 of the weighted covariance, where U are the left singular vectors, V T are 
the right singular vectors, and T> is the diagonal matrix of singular values. The number of columns 
of U, V, or V is equal to the number of retained singular values. 3 Then we define 

U = UV 1/2 (8) 
to be the mapping from the low-dimensional compressed space up to the high-dimensional space of 
features of the future. 

Given U, we would like to find a compression operator V that optimally predicts features of the 
future through the bottleneck defined by U. The least squares estimate can be found by minimizing 
the loss 

2 



£(V)= K. k -uv<j>£ k (9) 

r 

where || • denotes the Frobenius norm. We can find the minimum by taking the derivative of this 
loss with respect to V, setting it to zero, and solving for V (see Appendix, Section [A] for details), 
giving us: 

V = argmin£(^) = U T t TtH {t n , H )- x (10) 
By weighting different features of the future differently, we can change the approximate compression 



in interesting ways. For example, as we will see in Section 4.2 scaling up future reward by a constant 
factor results in a value-directed compression — but, unlike previous ways to find value-directed 
compressions ifTTl . we do not need to know a model of our system ahead of time. For another 
example, define Ly to be the lower triangular Cholesky factor of the empirical covariance of future 
features S7- 7-. Then, if we scale features of the future by L^ T , the singular value decomposition 
will preserve the largest possible amount of mutual information between features of the future and 
features of history. This is equivalent to canonical correlation analysis [26, 27), and the matrix T> 
becomes a diagonal matrix of canonical correlations between futures and histories. 



2 If we don't wish to reveal extra information by adding additional observations, we can instead add the 
corresponding feature predictions as observations; these predictions, by definition, reveal no additional infor- 
mation. To save the trouble of computing these predictions, we can use realized feature values rather than 
predictions in our learning algorithms below, at the cost of some extra variance: the expectation of the realized 
feature value is the same as the expectation of the predicted feature value. 

3 If our empirical estimate T,t,H were exact, we could keep all nonzero singular values to find the smallest 
possible compression that does not lose any predictive power. In practice, though, there will be noise in our 
estimate, and Et,^L^ t will be full rank. If we know the true rank n of Et.'H, we can choose the first n 
singular values to define a subspace for compression. Or, we can choose a smaller subspace that results in an 
approximate compression: by selectively dropping columns of U corresponding to small singular values, we can 
trade off compression against predictive power. Directions of larger variance in features of the future correspond 
to larger singular values in the SVD, so we minimize prediction error by truncating the smallest singular values. 
By contrast with an SVD of the unsealed covariance, we do not attempt to minimize reconstruction error for 
features of history, since features of history are standardized when we multiply by the inverse Cholesky factor. 
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3.3 Predictive State Temporal Difference Learning 



Now that we have found a predictive compression operator V via Equation 10 we can replace the 
features of history with the compressed features Vcffi in the Bellman recursion, Equation Q 
Doing so results in the following approximate Bellman equation: 

w T V^ k « K 1:k + 7 » T V& +1 (ID 

The least squares solution for w is still prone to an error-in-variables problem. The variable (fp^ 
is still correlated with the true independent variables and uncorrelated with noise, and so we can 
again use it as an instrumental variable to unbias the estimate of w. Define the additional empirical 
covariance matrices: 

t=l t=l 
Then, the corrected Bellman equation is: 

w t VY,-h,h = ^n.u +lw T V^u+M 
and solving for w gives us the Predictive State Temporal Difference (PSTD) learning algorithm: 

w T = (v^h.h -jVY.-h+^-h) (13) 



So far we have provided some intuition for why predictive features should be better than arbitrary 
features for temporal difference learning. Below we will show an additional benefit: the model- 
free algorithm in Equation [13] is, under some circumstances, equivalent to a model-based value 
function approximation method which uses subspace identification to learn Predictive State Repre- 
sentations 1 20 2~TTl . 



4 Predictive State Representations 

A predictive state representation (PSR) |14| is a compact and complete description of a dynami- 
cal system. Unlike POMDPs, which represent state as a distribution over a latent variable, PSRs 
represent state as a set of predictions of tests. 

Formally, a PSR consists of five elements (A, O, Q,si,F). A is a finite set of possible actions, 
and O is a finite set of possible observations. Q is a core set of tests, i.e., a set whose vector of 
predictions Q(h) is a sufficient statistic for predicting the success probabilities of all tests. F is 
the set of functions f T which embody these predictions: r(h) — f T (Q(h)). And, mi = Q(e) is 
the initial prediction vector. In this work we will restrict ourselves to linear PSRs, in which all 
prediction functions are linear: f T {Q{h)) — rjQ(h) for some vector r T £ W™. Finally, a core set 
Q for a linear PSR is said to be minimal if the tests in Q are linearly independent lfT6l[T31 . i.e., no 
one test's prediction is a linear function of the other tests' predictions. 

Since Q(h) is a sufficient statistic for all tests, it is a state for our PSR: i.e., we can remember just 
Q(h) instead of h itself. After action a and observation o, we can update Q(h) recursively: if we 
write M ao for the matrix with rows rJ OT for r £ Q, then we can use Bayes' Rule to show: 

_ M ao Q(h) M ao Q(h) 
^ [lia0) ~ Pr[o | h, do(o)] _ rr&MMh) { ' V 

where is a normalizer, defined by m^ c Q(/i) = 1 for all h. 

In addition to the above PSR parameters, we need a few additional definitions for reinforcement 
learning: a reward function 7Z(h) = r] T Q{h) mapping predictive states to immediate rewards, a 
discount factor 7 £ [0, 1] which weights the importance of future rewards vs. present ones, and a 
policy Tv(Q(h)) mapping from predictive states to actions. (Specifying a reward in terms of the core 
test predictions Q(h) is fully general: e.g., if we want to add a unit reward for some test r ^ Q, we 
can instead equivalently set rj := rj + r T , where r T is defined (as above) so that r(h) = rjQ(h).) 

Instead of ordinary PSRs, we will work with transformed PSRs (TPSRs) ll20l I2TI . TPSRs are a 
generalization of regular PSRs: a TPSR maintains a small number of sufficient statistics which are 
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linear combinations of a (potentially very large) set of test probabilities. That is, a TPSR maintains 
a small number of feature predictions instead of test predictions. TPSRs have exactly the same 
predictive abilities as regular PSRs, but are invariant under similarity transforms: given an invertible 
matrix S, we can transform mi — > Smi, — > m^S^ 1 , and M ao — > SMaoS^ 1 without changing 



14 The main benefit of TPSRs 



the corresponding dynamical system, since pairs S~ 1 S cancel in Eq. 
over regular PSRs is that, given any core set of tests, low dimensional parameters can be found 
using spectral matrix decomposition and regression instead of combinatorial search. In this respect, 
TPSRs are closely related to the transformed representations of LDSs and HMMs found by subspace 
identification [28, 29, 27 30 1. 



4.1 Learning Transformed PSRs 



Let Q be a minimal core set of tests for a dynamical system, with cardinality n = \Q\ equal to the 
linear dimension of the system. Then, let T be a larger core set of tests (not necessarily minimal, 
and possibly even with |T| countably infinite). And, let T~L be the set of all possible histories. (\H\ is 
finite or countably infinite, depending on whether our system is finite-horizon or infinite-horizon.) 

As before, write tffi G R £ for a vector of features of history at time t, and write cjxf £ WL e for a vector 
of features of the future at time t. Since T is a core set of tests, by definition we can compute any test 
prediction r(h) as a linear function of T(h). And, since feature predictions are linear combinations 
of test predictions, we can also compute any feature prediction (b(h) as a linear function of T(h). 
We define the matrix <£> r € R ex I 7 "! to embody our predictions of future features: that is, an entry of 
<1> T is the weight of one of the tests in T for calculating the prediction of one of the features in <j) T . 



Below we define several covariance matrices, Equation 15 a-d), in terms of the observable quantities 



bj , (ffi, a t , and o t , and show how these matrices relate to the parameters of the underlying PSR. 



These relationships then lead to our learning algorithm, Eq. 17 below. 



First we define the covariance matrix of features of histories, as E[0^(/>^ | ht ~ u>]. Given 

k samples, we can approximate this covariance: 



1 



k 

t=i 



->H,H 



ft' 



(15a) 



As k — > oo, the empirical covariance Ti-u,u converges to the true covariance Yi-h,u with probability 1. 

Next we define £5,^, the cross covariance of states and features of histories. Writing s t — Q(h t ) 
for the (unobserved) state at time t, let 



= E 



1 



-Sl:k4>Uk 



w(Vt) 



We cannot directly estimate £5,-^ from data, but this matrix will appear as a factor in several of the 
matrices that we define below. 
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Next we define Xr,«> the cross covariance matrix of the features of tests and histories: Sr,« = 
cffi \ht~0J, do(()]- The true covariance is the expectation of the sample covariance Sr,« : 



1 k 



t=i 



= E 



= E 



t=i 

k 



/i t ~w(Vt),do(C) (Vt) 



^E^J/^dotC)]^ 



t=i 



E 



K t=lT6T 



L t=i rer 



t=l 



^ ~w(Vt),do(C) (Vt) 
/i t - uj (Vt) 

/it ~ LJ (Vt) 

/it ~ w (Vt) 



= E $ U Te 



/it ~ co (Vt) 



(15b) 



where the vector r T is the linear function that specifies the probability of the test r given the proba- 
bilities of tests in the core set Q, and the matrix R has all of the r T vectors as rows. 

The above derivation shows that, because of our assumptions about the linear dimension of the 
system, the matrix £r.« has factors R € IRl r l xn and T,s,H <= ^ nxt - Therefore, the rank of Sr,« 
is no more than n, the linear dimension of the system. We can also see that, since the size of HfM 
is fixed but the number of samples k is increasing, the empirical covariance Sr,"W converges to the 
true covariance Y,-f.u with probability 1 . 

Next we define S^ i<IO; ^, a set of matrices, one for each action-observation pair, that represent the 
covariance between features of history before and after taking action a and observing o. In the 
following, I t (o) is an indicator variable for whether we see observation o at step t. 



1 k 



1=1 



^u.ao.u = E [^n,ao,n \h t ~uj (Vt), do(o) (Vt) 



E 



/it ~ w(Vt), do (a) (Vt) 



(15c) 



Since the dimensions of each are fixed, as fc — » oo these empirical co variances converge to 

the true covariances S« iao ,« with probability 1. 
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Finally we define £7?.^ = E[7?. t <^ T | h t ~ w], and approximate the covariance (in this case 
vector) of reward and features of history: 



= E 



E 



t=i 

E [Ere )W J /it ~ u (Vt) 



4=1 
A' 



/i t ~ w ( Vt) 



i^, T Q(/i t )^ T 



t; t E 



h t ~ a; (Vt) 



/it ~ w (Vt) 



(15d) 



Again, as fc — > 00, Ek,m converges to £7?,^ with probability 1. 

We now wish to use the above-defined matrices to learn a TPSR from data. To do so we need to 
make a somewhat -restrictive assumption: we assume that our features of history are rich enough to 
determine the state of the system, i.e., the regression from (jp 1 to s is exact: s t — Ss^S. 



We discuss how to relax this assumption below in Section 4.3 We also need a matrix U such that 
U T ^R is invertible; with probability 1 a random matrix satisfies this condition, but as we will see 



below, it is useful to choose U via SVD of a scaled version of SV.'H as described in Sec. 3.2 
Using our assumptions we can show a useful identity for £«. ao ,«: 



= E 



E 



1 



i=i 



1=1 
fc 



/it ~w(Vt), do(a) (Vt) 



t=i 

Mao^SM 



h t ~w(Vt), do(a) (Vt) 
/i t ~ a; (Vt) 



(16) 



This identity is at the heart of our learning algorithm: it shows that ^-H,ao,u contains a hidden copy 
of Af ao , the main TPSR parameter that we need to learn. We would like to recover M ao via Eq. 16 



M ao = S5 ^EjAjEft ,ao ^ Ut °^ course we do not know £5^. Fortunately, though, it turns 
out that we can use U t Yjt,u as a stand-in, as described below, since this matrix differs from Y^s,n 
only by an invertible transform (Eq. 15b 1. 



We now show how to recover a TPSR from the matrices £t,«, ^n.n^ ^n,ao,-H, an d U. 

Since a TPSR's predictions are invariant to a similarity transform of its parameters, our algorithm 
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only recovers the TPSR parameters to within a similarity transform. 



6« 



Ban = 



U t ^tm{'^h.-h) 1( i 
(U T $ T R)s t 

H,u) ^n,ao,-H 

C/ T $ r i 1 , I] 5 ^(S«^)- 1 E«, ao ,«(C/ T E r ,H) t 
(U T $ T R)M ao E 5 ,«(C/ T S r ^)t 

(C/ T $ r i?)M ao ((7 T $ r i?)- 1 (C/ T $ r i?)S] 5 ^(C/ T S] r ,«) t 
(U T <5> T R)M ao (U T <S> T R)- 1 

r/ T E s , w (£/ T £ r , w )t 

Tj T (^ T $ r Je)- 1 (?/ T * r i2)S5,w(l7 T Sr,w) t 
ri J {U T $> T R)- x 



(17a) 



(17b) 



(17c) 



Our PSR learning algorithm is simple: simply replace each true covariance matrix in Eq. [TTJby its 
empirical estimate. Since the empirical estimates converge to their true values with probability 1 as 
the sample size increases, our learning algorithm is clearly statistically consistent. 



4.2 Predictive State Temporal Difference Learning (Revisited) 



Finally, we are ready to show that the model-free PSTD learning algorithm introduced in Section 3.3 
is equivalent to a model-based algorithm built around PSR learning. For a fixed policy n, a TPSR's 
value function is a linear function of state, J w (s) = w T b, and is the solution of the TPSR Bellman 
equation [ 3 1 1 : for all b, w T b = bj t b + 7 J2oeo wT B^ob, or equivalently, 



W 



oeo 



If we substitute in our learned PSR parameters from Equations [17f a-c), we get 

w T = £ n , n {U T £ T ,n) ] + 7 w T C/ T S ri «(£„,„)^ 1 E«^(C/ T E r , K )t 

oeo 

;.TrrT? 1 -TrrT? \ — 1 



W U T, T>n = Enft + 7^ U £7-^(2-^) 

12 we can see that J^oeO ^•u.tto.'H = ^u+.u- Now, suppose 



15 : and 



since, by comparing Eqs 
that we define U and V by Eqs. [8 

[/ T E r ,w = V%in, and 



and let U = U as suggested above in Sec. 



4.1 



■ 1 - E-R..-H ( VT.H.H - iVT, 



w 



Then 



(18) 



Eq. 18 is exactly the PSTD algorithm (Eq. 13 I. So, we have shown that, if we learn a PSR by the 



subspace identification algorithm of Sec. |4.1| and then compute its value function via the Bellman 
equation, we get the exact same answer as if we had directly learned the value function via the 
model-free PSTD method. In addition to adding to our understanding of both methods, an important 
corollary of this result is that PSTD is a statistically consistent algorithm for PSR value function 
approximation — to our knowledge, the first such result for a TD method. 

PSTD learning is related to value-directed compression of POMDPs ifTTl . If we learn a TPSR from 
data generated by a POMDP, then the TPSR state is exactly a linear compression of the POMDP 
state iTTBI l20ll . The compression can be exact or approximate, depending on whether we include 
enough features of the future and whether we keep all or only some nonzero singular values in our 
bottleneck. If we include only reward as a feature of the future, we get a value-directed compression 
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in the sense of Poupart and Boutilier ifTTl . If desired, we can tune the degree of value -directedness 
of our compression by scaling the relative variance of our features: the higher the variance of the 
reward feature compared to other features, the more value-directed the resulting compression will 
be. Our work significantly diverges from previous work on POMDP compression in one important 
respect: prior work assumes access to the true POMDP model, while we make no such assumption, 
and learn a compressed representation directly from data. 



4.3 Insights from Subspace Identification 

The close connection to subspace identification for PSRs provides additional insight into the tem- 



poral difference learning procedure. In Equation 17 we made the assumption that the features 
of history are rich enough to completely determine the state of the dynamical system. In fact, 
using theory developed in ETl . it is possible to relax this assumption and instead assume that 
state is merely correlated with features of history. In this case, we need to introduce a new set 

of covariance matrices ^r,ao,n = ^[4>T\{°)'&t'~ T I ht ~ w,do(a, £)], one for each action- 
observation pair, that represent the covariance between features of history before and features 
of tests after taking action a and observing o. We can then estimate the TPSR transition ma- 
trices as B ao = U T T,-j- ao fi(U T T,f fiy (see ETl for proof details). The value function pa- 
rameter w can be estimated as w T = T,-jz^(U t T,']-^)^ (I — J2 oGO £/ T £r.ao,«(t^ T £r,'H)^ = 



^n,n(U T T,-]-,-H — J2 gO ^ T ^T,ao,n)^ (the proof is similar to Equation 18 1. Since we no longer 
assume that state is completely specified by features of history, we can noTonger apply the learned 

value function to WEfu (Eh.'h) -1 ^ at eacn ti me Instead we need to leam a full PSR model and 
filter with the model to estimate state. Details on this procedure can be found in lETl . 



5 Experimental Results 

We designed several experiments to evaluate the properties of the PSTD learning algorithm. In 
the first set of experiments we look at the comparative merits of PSTD with respect to LSTD and 
LARS-TD when applied to the problem of estimating the value function of a reduced-rank POMDP. 
In the second set of experiments, we apply PSTD to a benchmark optimal stopping problem (pricing 
a fictitious financial derivative), and show that PSTD outperforms competing approaches. 

5.1 Estimating the Value Function of a RR-POMDP 

We evaluate the PSTD learning algorithm on a synthetic example derived from [32 1. The problem is 
to find the value function of a policy in a partially observable Markov decision Process (POMDP). 
The POMDP has 4 latent states, but the policy's transition matrix is low rank: the resulting belief 
distributions can be represented in a 3-dimensional subspace of the original belief simplex. A reward 
of 1 is given in the first and third latent state and a reward of in the other two latent states (see 
Appendix, Section IB}. The system emits 2 possible observations, conflating information about the 
latent states. 

We perform 3 experiments, comparing the performance of LSTD, LARS-TD, PSTD, and PSTD as 
formulated in Section [43] (which we call PSTD2) when different sets of features are used. In each 
case we compare the value function estimated by each algorithm to the true value function computed 

by r =K(I -jT*)- 1 . 

In the first experiment we execute the policy ir for 1000 time steps. We split the data into overlapping 
histories and tests of length 5, and sample 10 of these histories and tests to serve as centers for 
Gaussian radial basis functions. We then evaluate each basis function at every remaining sample. 
Then, using these features, we learned the value function using LSTD, LARS-TD, PSTD with linear 
dimension 3, and PSTD2 with linear dimension 3 (Figure mA)). 4 In this experiment, PSTD and 
PSTD2 both had lower mean squared error than the other approaches. For the second experiment, 
we added 490 random features to the 10 good features and then attempted to learn the value function 
with each of the 3 algorithms (Figure [TjB)). In this case, LSTD and PSTD both had difficulty fitting 

4 Comparing LSTD and PSTD is straightforward; the two methods differ only by the compression operator 

V. 
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Figure 1: Experimental Results. Error bars indicate standard error. (A) Estimating the value function 
with a small number of informative features. PSTD and PSTD2 both do well. (B) Estimating the 
value function with a small set of informative features and a large set of random features. LARS-TD 
is designed for this scenario and dramatically outperforms PSTD and LSTD, however it does not 
outperform PSTD2. (C) Estimating the value function with a large set of semi-informative features. 
PSTD is able to determine a small set of compressed features that retain the maximal amount of 
information about the value function, outperforming LSTD by a very large margin. (D) Pricing a 
high-dimensional derivative via policy iteration. The y-axis is expected reward for the current policy 
at each iteration. The optimal threshold strategy (sell if price is above a threshold [33]) is in black, 
LSTD (16 canonical features) is in blue, LSTD (on the full 220 features) is cyan, LARS-TD (feature 
selection from set of 220) is in green, and PSTD (16 dimensions, compressing 220 features (16 + 
204)) is in red. 



the value function due to the large number of irrelevant features in both tests and histories and the 
relatively small amount of training data. LARS-TD, designed for precisely this scenario, was able 
to select the 10 relevant features and estimate the value function better by a substantial margin. 
Surprisingly, in this experiment PSTD2 not only outperformed PSTD but bested even LARS-TD. 
For the third experiment, we increased the number of sampled features from 10 to 500. In this case, 
each feature was somewhat relevant, but the number of features was relatively large compared to the 
amount of training data. This situation occurs frequently in practice: it is often easy to find a large 
number of features that are at least somewhat related to state. PSTD and PSTD2 both outperform 
LARS-TD and each of these subspace and subset selection methods outperform LSTD by a large 
margin by efficiently estimating the value function (Figure [TJC)). 

5.2 Pricing A High-dimensional Financial Derivative 

Derivatives are financial contracts with payoffs linked to the future prices of basic assets such as 
stocks, bonds and commodities. In some derivatives the contract holder has no choices, but in 
more complex cases, the contract owner must make decisions — e.g., with early exercise the contract 
holder can decide to terminate the contract at any time and receive payments based on prevailing 
market conditions. In these cases, the value of the derivative depends on how the contract holder 
acts. Deciding when to exercise is therefore an optimal stopping problem: at each point in time, 
the contract holder must decide whether to continue holding the contract or exercise. Such stopping 
problems provide an ideal testbed for policy evaluation methods, since we can easily collect a single 
data set which is sufficient to evaluate any policy: we just choose the "continue" action forever. (We 
can then evaluate the "stop" action easily in any of the resulting states, since the immediate reward 
is given by the rules of the contract, and the next state is the terminal state by definition.) 

We consider the financial derivative introduced by Tsitsiklis and Van Roy [33). The derivative 
generates payoffs that are contingent on the prices of a single stock. At the end of a given day, the 
holder may opt to exercise. At exercise the owner receives a payoff equal to the current price of the 
stock divided by the price 100 days beforehand. We can think of this derivative as a "psychic call": 
the owner gets to decide whether s/he would like to have bought an ordinary 100-day European call 
option, at the then-current market price, 100 days ago. 

In our simulation (and unknown to the investor), the underlying stock price follows a geometric 
Brownian motion with volatility a = 0.02 and continuously compounded short term growth rate 
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p = 0.0004. Assuming stock prices fluctuate only on days when the market is open, these parameters 
correspond to an annual growth rate of ~ 10%. In more detail, if w t is a standard Brownian motion, 
then the stock price p t evolves as Vp t = ppt^t + <Jp t Vw t , and we can summarize relevant state 

at the end of each day as a vector x t £ M 100 , with x t = ( £*=^ H^l . -^-) . The ith 

dimension x t (i) represents the amount a $1 investment in a stock at time t — 100 would grow to at 
time t — 100 + i. This process is Markov and ergodic I33ll34l : Xt and Xt+ioo are independent and 
identically distributed. The immediate reward for exercising the option is G(x) = x(100), and the 
immediate reward for continuing to hold the option is 0. The discount factor 7 = e~ p is determined 
by the growth rate; this corresponds to assuming that the risk-free interest rate is equal to the stock's 
growth rate, meaning that the investor gains nothing in expectation by holding the stock itself. 

The value of the derivative, if the current state is x, is given by V* (x) = sup t K[ r y t G(xt) \ Xq = x\. 
Our goal is to calculate an approximate value function V(x) = w T cf> H (x), and then use this value 
function to generate a stopping time min{i | G(x t ) > V(xt)}- To do so, we sample a sequence 
of 1,000,000 states x t £ R 100 and calculate features (jp 1 of each state. We then perform policy 
iteration on this sample, alternately estimating the value function under a given policy and then 
using this value function to define a new greedy policy "stop if G(xt) > w 1 (j^ (xt) ■" 

Within the above strategy, we have two main choices: which features do we use, and how do we 
estimate the value function in terms of these features. For value function estimation, we used LSTD, 
LARS-TD, or PSTD. In each case we re-used our 1,000,000-state sample trajectory for all iterations: 
we start at the beginning and follow the trajectory as long as the policy chooses the "continue" action, 
with reward at each step. When the policy executes the "stop" action, the reward is G(x) and the 
next state's features are all 0; we then restart the policy 100 steps in the future, after the process 
has fully mixed. For feature selection, we are fortunate: previous researchers have hand-selected a 
"good" set of 16 features for this data set through repeated trial and error (see Appendix, Section [B] 
and [33 34 1). We greatly expand this set of features, then use PSTD to synthesize a small set of high- 
quality combined features. Specifically, we add the entire 100-step state vector, the squares of the 
components of the state vector, and several additional nonlinear features, increasing the total number 
of features from 16 to 220. We use histories of length 1, tests of length 5, and (for comparison's 
sake) we choose a linear dimension of 16. Tests (but not histories) were value-directed by reducing 
the variance of all features except reward by a factor of 100. 

Figure [Tp shows results. We compared PSTD (reducing 220 to 16 features) to LSTD with either 
the 16 hand-selected features or the full 220 features, as well as to LARS-TD (220 features) and to 
a simple thresholding strategy [ 33 1 . In each case we evaluated the final policy on 10,000 new ran- 
dom trajectories. PSTD outperformed each of its competitors, improving on the next best approach, 
LARS-TD, by 1.75 percentage points. In fact, PSTD performs better than the best previously re- 
ported approach Il33ll34ll by 1.24 percentage points. These improvements correspond to appreciable 
fractions of the risk-free interest rate (which is about 4 percentage points over the 100 day window 
of the contract), and therefore to significant arbitrage opportunities: an investor who doesn't know 
the best strategy will consistently undervalue the security, allowing an informed investor to buy it 
for below its expected value. 

6 Conclusion 

In this paper, we attack the feature selection problem for temporal difference learning. Although 
well-known temporal difference algorithms such as LSTD can provide asymptotically unbiased es- 
timates of value function parameters in linear architectures, they can have trouble in finite samples: 
if the number of features is large relative to the number of training samples, then they can have 
high variance in their value function estimates. For this reason, in real-world problems, a substantial 
amount of time is spent selecting a small set of features, often by trial and error ll33l[34l . 

To remedy this problem, we present the PSTD algorithm, a new approach to feature selection for 
TD methods, which demonstrates how insights from system identification can benefit reinforcement 
learning. PSTD automatically chooses a small set of features that are relevant for prediction and 
value function approximation. It approaches feature selection from a bottleneck perspective, by 
finding a small set of features that preserves only predictive information. Because of the focus 
on predictive information, the PSTD approach is closely connected to PSRs: under appropriate 
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assumptions, PSTD's compressed set of features is asymptotically equivalent to TPSR state, and 
PSTD is a consistent estimator of the PSR value function. 

We demonstrate the merits of PSTD compared to two popular alternative algorithms, LARS-TD 
and LSTD, on a synthetic example, and argue that PSTD is most effective when approximating a 
value function from a large number of features, each of which contains at least a little information 
about state. Finally, we apply PSTD to a difficult optimal stopping problem, and demonstrate the 
practical utility of the algorithm by outperforming several alternative approaches and topping the 
best reported previous results. 
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Appendix 



A Determining the Compression Operator 

We find a compression operator V that optimally predicts test-features through the CCA bottleneck 
defined by U . The least squares estimate can be found by minimizing the following loss 



C{V) 



V = argmin£(V) 



where || ■ \\p denotes the Frobenius norm. We can find V by taking a derivative of this loss C with 
respect to V, setting it to zero, and solving for V 



» dC 

» dC 

» dC 
dC 

=> 



itr ((^ fe - C/y^ fe )(0L - UV4% k ) T 
itr (^LVL ~ 2^ T [H^& + <t%. k T V T U T UVcf% k 
-2tr (^ fc T dy T C/ T ^) + 2tr (^ fe T dF T C/ T C/F^ & 
2tr f dV T tT T ^ t ^ t T ) + 2tr (dy T E/ T E/l^ ^ T 



-2tr (dF T [/ T £ r ,«) + 2tr (W T £/ T 
-2tr (?7 T E r> tt) + 2tr (p T UVY, nM 

-U T % T .-h + U r UVE n>n 

(LT T I?)- 1 £/ T E r)W (E WiW )- 1 



B Experimental Results 

B.l RR-POMDP 

The RR-POMDP parameters are: 

[m = 4 hidden states, n = 2 observations, A; = 3 transition matrix rank]. 



rpTT 
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0.4262 0.0465 

0.4380 0.0959 

0.0959 0.7840 



O 



10 10 
10 1 



The discount factor is 7 = 0.9. 



B.2 Pricing a financial derivative 

Basis functions The fist 16 are the basis functions suggested by Van Roy; for full description 
and justification see [33, 34 1. The first functions consist of a constant, the reward, the minimal and 
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maximal returns, and how long ago they occurred: 

= 1 

= G{x) 

= min x{i) — 1 
i=l,...,100 

= max x{i) — 1 

i=l,. ..,100 

= argmin x(i) — 1 

i=l,. ..,100 

= argmax x(i) — 1 

i=l,...,100 

The next set of basis functions summarize the characteristics of the basic shape of the 100 day 
sample path. They are the inner product of the path with the first four Legendre polynomial degrees. 
Let j = i/50 - 1. 

07(e) 



(j)g(x) 



0io(e) 



Nonlinear combinations of basis functions: 



ii(e) 


= 02(e)^ 3 (e) 




= 4>2{x)4>i(x) 


13(e) 


= fo(x)(j>7(x) 


u{x) 


= (f) 2 (x)(j) S {x) 


15(e) 


= 4> 2 {x)4> 9 (x) 


16(e) 


= 02(e) 010 (e) 



In order to improve our results, we added a large number of additional basis functions to these 
hand-picked 16. PSTD will compress these features for us, so we can use as many additional basis 
functions as we would like. First we defined 4 additional basis functions consisting of the inner 
products of the 100 day sample path with the 5th and 6th Legende polynomials and we added the 
corresponding nonlinear combinations of basis functions: 

017(e) 

0is(e) 

019(e) 
020 (e) 

Finally we added the the entire sample path and the squared sample path: 

021:120 = El : 100 

/ 2 
0121:220 = E 1:100 



01 (E 

02 (e 

03 (X, 

04 (E 

05 (e 

06 (x 




^ 100 

i=l 
100 

i=l 

02(e)01 7 (e) 
02(e)018(x) 



100 



9 

n 

2 



35j 4 - 30a; 2 + 3 



63j 5 - 70j 3 + 15j 
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